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Abstract 

As part of BioMed Central's open science mission, we are pleased to announce that two of our journals have 
integrated with the open data repository Dryad. Authors submitting their research to either BMC Ecology or BMC 
Evolutionary Biology will now have the opportunity to deposit their data directly into the Dryad archive and will 
receive a permanent, citable link to their dataset. Although this does not affect any of our current data deposition 
policies at these journals, we hope to encourage a more widespread adoption of open data sharing in the fields of 
ecology and evolutionary biology by facilitating this process for our authors. We also take this opportunity to 
discuss some of the wider issues that may concern researchers when making their data openly available. Although 
we offer a number of positive examples from different fields of biology, we also recognise that reticence to data 
sharing still exists, and that change must be driven from within research communities in order to create future 
science that is fit for purpose in the digital age. 

This editorial was published jointly in both BMC Ecology and BMC Evolutionary Biology. 



Background 

This week we announce the integration of BMC Ecology 
and BMC Evolutionary Biology with the data repository 
Dryad. The hope behind this integration is not just to 
encourage authors to open up the data behind the arti- 
cles they pubUsh with us, but to facilitate it. Although 
the Dryad repository hosts research data from across all 
fields of science and medicine, it has been among the eco- 
logical and evolutionary biology research communities 
that deposition has most frequently been taken up [1]. It 
is for this reason that we have targeted these journals 
specifically, with a view to extending integration to 
other fields in the future. 

On a practical level, what does this integration mean? If 
an author submits a paper to either of the aforementioned 
journals, they will receive an email with a one-time only 
link to Dryad with instructions on how to deposit their 
data, and how and where to cite the dataset in their paper 
using best practices from DataCite [2]. Once the paper is 
published, we at BioMed Central will notify Dryad, and 
they will update their records accordingly. 

This does not mean we are changing the data-sharing 
policies of BMC Ecology and BMC Evolutionary Biology, 
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at least for the moment. Like all journals published by 
BioMed Central, we strongly encourage all of our 
authors to archive, and make openly available, the data 
underlying their article. However, in the light of this 
update, we felt that this might also be a useful opportunity 
to speak to our authors about data policy more generally, 
in the hope of raising greater awareness of some of the 
major issues surrounding the debate. 

The role of the publisher in data availability is some- 
thing many publishers, especially open access publishers, 
have been discussing at least internally if not also exter- 
nally. Many reading this will be familiar with the recent 
discussion around PLoS's own change in policy, requir- 
ing that authors publishing with a PLoS journal make 
the data underlying the study publicly available (with 
rare exception) and to note their compliance with this in 
a Data Availability Statement [3,4]. At BioMed Central 
our policy states that "submission of a manuscript to a 
BioMed Central journal implies that readily reproducible 
materials described in the manuscript, including all rele- 
vant raw data, will be freely available to any scientist 
wishing to use them for non-commercial purposes" [5]. 
The idea that the data underlying a study should be 
available for validation of its conclusions is not unrea- 
sonable and is, indeed, a condition of submission for 
most respectable journals. It has to be. 
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In reality, however, any enforcement of sharing data is 
normally a private matter, with individual researchers con- 
tacting either individual authors or the publisher to 
request access. Many researchers are happy with such 
sharing, often welcoming the spur to collaboration it 
provides. However, the practicalities of tracking down data 
in this way are highly problematic. Private hard drives are 
not reliable; nor are they persistent. Similarly, researchers 
move, change jobs, and so on. Publicly available data 
housed in a repository removes the burden from the re- 
searcher to maintain and privately share his or her data. 
Indeed, sharing one's data can be seen as a matter of 
convenience as well. It can be far less hassle to deposit 
your data immediately, only having to remember your 
name and the repository you put it in. In addition, "behind 
closed doors" sharing creates an inequality, as Poisot, 
Mounce, and Gravel note: "...those with good contacts 
have access to datasets, while others are left out" [6]. 

Yet, making data publicly available is very inconsistent 
across fields and researchers. Indeed, even some strong 
open access advocates have at least questioned data 
sharing as a policy. Proprietary and clinical data aside, 
why is this the case? 

Opportunity cost 

Dr Erin McKiernan, a neurophysiologist working in 
Mexico and a strong open access advocate, points to a 
lack of funding for developing-world researchers and the 
practical implications of sharing data at the time of first 
publication, when that data is needed to sustain that lab 
through the publication of papers for 3 to 5 years to 
come [7]. Of course, being scooped is a huge fear of 
researchers, but what about the lost opportunity cost of 
increased collaboration or the extra data now made 
available to researchers in the developing world due to 
greater data sharing? Indeed, Dr McKiernan does recog- 
nise in her comments the possible benefit of supplement- 
ing her own data with other types of data: "...open 
electrophysiological or epidemiological data would cer- 
tainly help me to improve the models I use in my work. I 
can also think of examples in which a lab could extend or 
support their smaller primary data set with open data." 

Ecological and evolutionary science has a long history 
of conducting research in less developed countries, 
partly because these areas of the world also happen to 
harbour its richest biological diversity [8]. Researchers 
from developed economies conducting research in these 
parts of the world will no doubt recognise the difficulties 
that their collaborators face in accessing the full gamut 
of resources needed to conduct quality research — from 
basic equipment to access to literature. The same is also 
true of data. Like the Declaration of Helsinki [9] in 
medical research, which states that research should 
benefit the populations on which it is conducted, many 



biologists are now recognising that a basic prerequisite 
of acquiring data from emerging economies should be 
that it is accessible to researchers in those countries 
[10]. Similarly, where research into applied problems 
stands to influence policy decisions in these countries, 
more attention and support needs to be paid to native 
researchers [11]. 

Shared benefits 

Although the infrequency of data sharing within many 
research fields makes it difficult to point to examples of 
the benefits of data sharing on collaboration, some 
examples can be seen in the genomics community, which 
has a lengthier history of sharing data. For example, the 
release of this microbiome dataset [12] resulted in a collab- 
oration with the Agency for Science Technology and 
Research (A*STAR), who are currently using this data to 
build a new generation of tools for microbiome data. The 
publication of these tools will help to make this dataset 
the gold standard and reference for microbiome data, thus 
highlighting the authors' research. 

In 2005 in an article published by Genome Biology, 
authors using the Trace Archive — a repository for raw, 
unanalyzed genomic sequencing data — discovered three 
new species of the bacterial endosymbiont Wolbachia 
pipientis in three different species of fruit fly: Drosophila 
ananassae, D. simulans, and D. mojavensis [13]. The 
study shined a light on the benefits to researchers of 
having publicly available raw data. 

A final star example demonstrating the benefits to 
collaboration and the increased pace of science when we 
share comes from the 2011 E. coli 0104:H4 outbreak in 
Europe, to which over 3,500 people fell ill (resulting in 
53 deaths). What marks this story as particularly inspir- 
ing is its break from the usual scientific procedure of 
data production, data analysis, and then publication after 
a long process of peer-review. Due to the severity of the 
outbreak, the Beijing Genomics Institute (BGI) immedi- 
ately released the full genome sequence of the strain 
within 5 days of receiving the genomic data of the 
outbreak sample. News of the release of the full genome 
sequence data was then aired via Twitter. Within 
24 hours a GitHub repository had been created and 
further analyses were subsequently crowdsourced [14]. 
Within a couple of days, a potential ancestral strain had 
been found. Such rapid genomic analyses allowed for the 
origins and nature of the pathogen to be much better 
understood [15]. The story also exemplifies a crucial point 
to be made regarding scientific credit and etiquette, and 
the sharing of not only data but sharing the analysis of 
that data. 

The open source analysis for the outbreak was published 
in the New England Journal of Medicine [16], proving that 
faster data dissemination and analysis through sharing 
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need not have to undermine traditional scientific struc- 
tures of credit. 

Blood, sweat and tears 

Of course, the genomics community, with its longer 
history of data sharing, has some strong positive exam- 
ples where data sharing has benefitted researchers, but 
sequencing data can differ greatly from species abun- 
dance or behavioural trait data, for example. A key point 
to make of the genomics community, especially with the 
deposition of raw data, is that these data might not be as 
"hard won" as datasets in other fields. Indeed, it is not 
unheard of that certain genomics institutes produce so 
much data they won't possibly ever be able to write up 
papers for all of it. 

An ecological dataset can last a researcher many years 
and many papers. A question we also must ask then is 
how will the amount of and type of data produced 
change when researchers are guaranteed only one paper 
from that dataset? Will there be an incentive to collect 
those more "hard won" datasets? 

Many ecologists will be all too familiar with "pouring 
their blood, sweat, and tears" into a dataset, perhaps 
having gathered the data through years of field work, 
possibly having developed and maintained unique field 
sites themselves (e.g., establishing nest boxes to encour- 
age birds to remain at their field sites, or long-term 
monitoring of plant communities under different experi- 
mental treatments). 

For such datasets it is important to note a few things. 
First, one can deposit (and thus gain credit) for a dataset 
at early stages. One could release small, perhaps yearly, 
versions of the dataset, accruing many individual research 
products over the course of the experiment's lifespan. 
Indeed, one's research products are only ever versions of 
the entire story of one's work. Will these be as useful as a 
dataset collected over 30 years? Probably not. But they will 
reflect your productivity as a scientist. And when you do 
publish your dataset collected over 30 years, yes, someone 
could use it — as you could use someone else's, allowing 
you to compare and contrast an unlimited resource 
of data. 

A bigger picture 

Consider the papers that could emerge by combining 
your dataset with datasets that were previously inaccess- 
ible. The emergence of new fields such as macroecology, 
dedicated to the analysis of large-scale multispecies 
datasets, relies on the availability of disparate sources 
of data in order to uncover broad patterns in eco- 
logical and evolutionary processes. Integrating these 
data across different scales of space and time is cer- 
tainly a challenge — but so too is getting access to this 
data in the first instance [17]. Only by creating stronger 



community standards for access to, and annotation of, 
this data will higher-quality analysis by achievable in 
the future. 

Some might argue collaboration will decrease. Why 
would you be contacted if your data is already out there, 
free to reuse? In the genomics community, collaboration 
has come from unintended reuses of that data. The 
microbiome data previously mentioned is one such 
example. 

There is no reason why the same cannot be true 
among ecologists. What is needed, however, are clearer 
guidelines in terms of communication and etiquette on 
what is expected of researchers who choose to reuse 
data, and it is important to note that the positive exam- 
ples of reuse that have been mentioned here involved 
proper communication with, and recognition of the ori- 
ginal data producer [18]. It is understandable that many 
researchers will be apprehensive about what could be 
perceived as a loss of control over their data. However, 
like many biologists now recognise, the benefits that 
data archiving can bring to the field offset many of these 
perceived fears, and may bring about new collaborative 
opportunities. 

Data management 

Some of the more prominent data sharing communities, 
like genomics, also have a fairly standardised way of 
presenting data. Ecological and evolutionary data is typ- 
ically very difficult to standardise, since it can be highly 
heterogeneous. The diversity of sub-fields collecting data 
on very different scales of grain, extent, and time — from 
marine microbes to whole terrestrial ecosystems — make 
these highly challenging disciplines to integrate. This is 
not to say it cannot be done, or that it shouldn't be 
done, but rather to indicate that many fields are starting 
in a very different place than the genomics community. 

A recent view into the future of biodiversity research 
puts open data at the top of a list of priorities facing the 
"grand challenge" of making sense of the current ecolo- 
gical data deluge, but recognises that much improved 
infrastructure and standardisation is needed to meet 
this challenge [19]. A key component of this will be 
better encoding and structuring of different forms of 
data through the use of controlled vocabularies and 
ontologies to ensure data are machine-readable and 
human-understandable. Many barriers still exist to the im- 
plementation of the recommendations, but the infrastruc- 
ture for allowing it to happen is emerging [20]. 

Better data management will be essential, and will 
need to be written into grant applications and recog- 
nised by funders. A partner organisation of BioMed 
Central's making much headway in this area are the 
open source metadata tracking ISA Tools [21]. These 
tools can be applied across the life sciences to help 
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better describe rich metadata, making your dataset and 
study more reusable and reproducible. These tools are 
more appropriate for some studies than others, but they 
continue to be built on and represent a good starting 
point. This is not to say that more refined data manage- 
ment won't mean more work at least in the short term. 
But perhaps data management is a skill required of a 
21st century scientist. As ecologist Edmund Hart con- 
cludes in his blog on the subject, "I think we just need 
to own up to the fact [that] being a scientist these days 
requires new skills... In the 1990's how many ecologists 
could do a mixed-effects model? Now I see them all the 
time. In the 21st century to do science better, we need 
more than spreadsheets with a few rows, we need to im- 
plement best practice for data management" [22]. 

However, even when a relatively standardised ap- 
proach to data formatting and archiving exists, there can 
still be problems in ensuring data is properly archived 
and deposited in a way in which other users may easily 
re-use it. The phylogenetic tree repository TreeBase is 
the most widely used archive for this type of data, and 
has been known to the evolutionary biology community 
for many years, with many journals in the field stating 
deposition of data here as a requirement for publication. 
Yet even with this community-wide adoption, a recent 
analysis of the literature in this field found that only a 
small fraction of data was made publicly available by 
authors [23]. Even among those datasets that were made 
available, inconsistencies in formatting and labelling 
mean that this fraction is reduced further such that only 
a tiny amount of usable data is truly available even when 
the right technical infrastructure is in place. 

The situation may be even worse in ecology, where 
datasets are typically much more variable and few dedi- 
cated repositories exist. Estimates of discoverability among 
the ecological literature are even more stark than in evolu- 
tionary biology, with perhaps as little as 1% being access- 
ible after publication [24] . 

Credit where credit is due 

In terms of benefits to authors, many point to the now 
extra citeable research product you have in the form of a 
dataset. Some have mentioned a citation to a dataset 
"isn't much credit at all" [25] — pointing to perhaps the 
disappointing truth that among funders and universities, 
papers still are the highest form of "productivity". 
Although organisations like Mozilla Science Lab are 
working to counteract this [26] with organisations like 
GitHub, as things stand, until all funders and universities 
truly begin to value all research objects, this hierarchy 
will remain in place. 

In addition to strong data citation guidelines, one an- 
swer we see at BioMed Central is to help researchers get 
credit for their data and encourage its reuse (and thus 



future citation) through more traditional lines of credit, 
such as the article. Data notes as an article type are 
available in many of our journals, like GigaScience and 
BMC Research Notes. A data note focuses on the data 
(the methodology behind it, its validation, its reuse 
potential) rather than the conclusions found after ana- 
lysing it. It also offers a chance for a dataset to be peer- 
reviewed. In this way, an author can validate his or her 
study by shining a light on the validity, and strong reuse 
potential, of the data behind the study. Recognising the 
potential of this will, of course, require a shift in the 
perception of what constitutes a valuable contribution to 
scientific output, and a shift in the role that scientific 
publishing can play in ensuring that the heterogeneous 
data of ecology and evolutionary biology are fit for pur- 
pose in the digital age [27]. 

Making data publicly available is also another way for 
authors to add an additional research output, and of 
course, datasets are now recognised by the National 
Science Foundation [28] and other funders as a research 
product to be included in grant proposals. Studies have 
also shown that publicly available data connected to an 
article is associated with an increased citation rate 
[29-31]. Indeed, Piwowar and Vision recently found the 
increased rate can be as much as 30%, depending on the 
length of time the data has been public [32]. Publicly 
available data greatly points to increased research impact 
for individual researchers. 

Transparency and trust 

Another incentive behind sharing data is, of course, 
the validation of research. In November 2009 a 
hacker entered the computer system at the Climate 
Research Unit at the University of East Anglia and 
exposed emails and documents showing climate scien- 
tists not only distorting data to exacerbate evidence 
of global warming but refusing to share raw data with 
critics of their work. In February 2010 a poll found a 
30% drop in the past year in the percentage of British 
adults believing in climate change [33]. Incidences like 
"Climategate" are not only damaging to the reputation of 
all scientists but also are a detriment to public under- 
standing of science, which is often the evidence base 
for important policy decisions, such as in the case of 
climate change. 

The open availability of data ensures transparency and 
traceability of results, which may be checked by anyone 
wishing to do so. For ecological science, this is especially 
important since researchers working on field-based studies 
may have far greater difficulty in replicating experi- 
ments under differing environmental conditions than 
would be the case under a controlled laboratory environ- 
ment [24]. Development of standardised metadata to 
trace provenance, especially for studies integrating 



Kenall ef al. BMC Evolutionary Biology 2014, 14:66 
http://www.biomedcentral.com/1471-2148/14/66 



Page 5 of 6 



many disparate data sources, will be crucial in ensur- 
ing future science meets the highest standards of 
quality and reproducibility. 

Concluding remarks 

We are now in an age where communication across geo- 
graphic and cultural barriers has been facilitated like never 
before, and there is little excuse for why the adoption of 
community standards for better data management cannot 
be achieved. There is also little excuse for continuing with 
the current loss in ecological and evolutionary data that 
has preceded the digital age. Think of the value that access 
to the past century of ecological and evolutionary litera- 
ture would have to researchers working today, the many 
thousands of labour-hours expended collecting biological 
knowledge across many scales of time and space. Prevent- 
ing the loss of this knowledge for future generations of 
biologists depends on the decisions of the research com- 
munity — and it's never been easier than now. 

Stories, positive and negative, point to the benefits of 
data sharing, but we won't know all the benefits, nor 
exactly how data sharing will change the way researchers 
practice science, until sharing data becomes standard. 
Meanwhile, as a publisher we're in a difficult position. On 
the one hand, as an open access publisher, a major drive 
behind nearly everything we do is to make publishing re- 
search easy and painless. Publishing is a service to an au- 
thor. On the other hand, we are driven by an open science 
mission that we believe not only makes better science but 
a better world. We are still discussing internally what this 
means for our data sharing policy, but in the meantime 
we are excited to see the recent discussion around open 
data taking place and encourage our authors to voice their 
own thoughts on the matter in the comments below. 

It seems likely that meeting the challenges facing the 
natural world in the Anthropocene era will require large- 
scale global collaboration among researchers across ecol- 
ogy and evolutionary biology. We hope that the long-term 
benefits of opening up access to data for everyone are 
likely to outweigh some of the shorter-term difficulties of 
data sharing, and strongly encourage all of our authors to 
make their data openly available. We're ready to work 
with researchers to ensure that facilitating this is made 
possible across the board, and pleased to endorse new 
initiatives, like our integration with Dryad, that seek to 
make this happen. 
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